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Abstract 

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate 
geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights 
that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest 
descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is 
easy and efficient to implement and leads to empirical gains over SGD and AdaGrad. 


1 Introduction 

Training deep networks is a challenging problem |[T^ [2l and various heuristics and optimization algorithms 
have been suggested in order to improve the efficiency of the training l[5l[9l|4l. However, training deep 
architectures is still considerably slow and the problem has remained open. Many of the current training 
methods rely on good initialization and then perfornfing Stochastic Gradient Descent (SGD), sometimes 
together with an adaptive stepsize or momentum term ifT^ fTlIbl. 

Revisiting the choice of gradient descent, we recall that optimization is inherently tied to a choice of ge¬ 
ometry or measure of distance, norm or divergence. Gradient descent for example is tied to the norm as it 
is the steepest descent with respect to £2 norm in the parameter space, while coordinate descent corresponds 
to steepest descent with respect to the norm and exp-gradient (multiplicative weight) updates is tied to 
an entropic divergence. Moreover, at least when the objective function is convex, convergence behavior is 
tied to the corresponding norms or potentials. For example, with gradient descent, or SGD, convergence 
speeds depend on the £2 norm of the optimum. The norm or divergence can be viewed as a regularizer for 
the updates. There is therefore also a strong link between regularization for optimization and regularization 
for learning: optinfization may provide implicit regularization in terms of its corresponding geometry, and 
for ideal optinfization performance the optimization geometry should be aligned with inductive bias driving 
the learning Hfil . 

Is the £2 geometry on the weights the appropriate geometry for the space of deep networks? Or can we 
suggest a geometry with more desirable properties that would enable faster optimization and perhaps also 
better implicit regularization? As suggested above, this question is also linked to the choice of an appropriate 
regularizer for deep networks. 

Focusing on networks with RELU activations, we observe that scaling down the incoming edges to a 
hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network computing 
the same function. Since predictions are invariant to such rescalings, it is natural to seek a geometry, and 
corresponding optimization method, that is similarly invariant. 
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We consider here a geometry inspired by max-norm regularization (regularizing the maximum norm of 
incoming weights into any unit) which seems to provide a better inductive bias compared to the £2 norm 
(weight decay) IUfTSl. But to achieve rescaling invariance, we use not the max-norm itself, but rather the 
minimum max-norm over all rescalings of the weights. We discuss how this measure can be expressed as a 
“path regularizer” and can be computed efficiently. 

We therefore suggest a novel optimization method, Path-SGD , that is an approximate steepest descent 
method with respect to path regularization. Path-SGD is rescaling-invariant and we demonstrate that Path- 
SGD outperforms gradient descent and AdaGrad for classifications tasks on several benchmark datasets. 

Notations A feedforward neural network that computes a function / : ^ can be represented 

by a directed acyclic graph (DAG) G{V^ E) with D input nodes Vin[l ]->..., Vin[D] ^ V, C output nodes 
VoutiM^ • • • 5 ^out[C] G V, weights w : E and an activation function a : M ^ M that is applied on the 
internal nodes (hidden units). We denote the function computed by this network as fG,w,a- In this paper we 
focus on RELU (REctified Linear Unit) activation function crRELu(^) = max{0, x}. We refer to the depth 
d of the network which is the length of the longest directed path in G. Eor any 0 < i < d, we define to 
be the set of vertices with longest path of length i to an input unit and is defined similarly for paths to 
output units. In layered networks is the set of hidden units in a hidden layer i. 

2 Rescaling and Unbalanceness 

One of the special properties of RELU activation function is non-negative homogeneity. That is, for any 
scalar c > 0 and any x G M, we have crRELu(c • x) = c • crRELu(^)- This interesting property allows the 
network to be rescaled without changing the function computed by the network. We define the rescaling 
function such that given the weights of the network w : E a constant c > 0, and a node v, the 

rescaling function multiplies the incoming edges and divides the outgoing edges of v by c. That is, pc^y{w) 
maps w to the weights w for the rescaled network, where for any {ui U 2 ) G 


^{ui^u2) Otherwise. 

It is easy to see that the rescaled network computes the same function, i.e. = fc.pc v{w),a^EL\j' 

We say that the two networks with weights w and w are rescaling equivalent denoted hy w w if and only 
if one of them can be transformed to another by applying a sequence of rescaling functions pc^y. 

Given a training set S — {{xi^yn )->..., {xn->yn)}^ our goal is to minimize the following objective 
function: 

Hw) = ( 2 ) 

n 

i=\ 

Let be the weights at step t of the optimization. We consider update step of the following form = 

Eor example, for gradient descent, we have where p is the step- 

size. In the stochastic setting, such as SGD or mini-batch gradient descent, we calculate the gradient on a 
small subset of the training set. 

Since rescaling equivalent networks compute the same function, it is desirable to have an update rule that 
is not affected by rescaling. We call an optimization method rescaling invariant if the updates of rescaling 
equivalent networks are rescaling equivalent. That is, if we start at either one of the two rescaling equivalent 
weight vectors ^ after applying t update steps separately on and they will remain 
rescaling equivalent and we have ^ 
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(b) Weight Explosion in an unbalanced network 




(c) Poor updates in an unbalanced network 


Figure 1: (a): Evolution of the cross-entropy error function when training a feed-forward network on MNIST with 
two hidden layers, each containing 4000 hidden units. The unbalanced initialization (blue curve) is generated by 
applying a sequence of rescaling functions on the balanced initializations (red curve), (b): Updates for a simple case 
where the input is x = 1, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output 
is 6 = —1. (c): Updated network for the case where the input is x = (1,1), thresholds are set to zero (constant), the 
stepsize is 1, and the gradient with respect to output is ^ = (-1,-1). 


Unfortunately, gradient descent is not rescaling invariant. The main problem with the gradient updates 
is that scaling down the weights of an edge will also scale up the gradient which, as we see later, is exactly 
the opposite of what is expected from a rescaling invariant update. 

Furthermore, gradient descent performs very poorly on “unbalanced” networks. We say that a network 
is balanced if the norm of incoming weights to different units are roughly the same or within a small range. 
For example, Figure [T(a)] shows a huge gap in the performance of SGD initialized with a randomly gener¬ 
ated balanced network when training on MNIST, compared to a network initialized with unbalanced 
weights Here is generated by applying a sequence of random rescaling functions on (and 
therefore 

In an unbalanced network, gradient descent updates could blow up the smaller weights, while keeping 
the larger weights almost unchanged. This is illustrated in Figure |l(b)[ If this were the only issue, one could 
scale down all the weights after each update. However, in an unbalanced network, the relative changes in the 
weights are also very different compared to a balanced network. For example. Figure [T(c)] shows how two 
rescaling equivalent networks could end up computing a very different function after only a single update. 
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3 Magnitude/Scale measures for deep networks 


Following ifT^ . we consider the grouping of weights going into each node of the network. This forms the 
following generic group-norm type regularizer, parametrized by 1 < p, g < oc: 




/ / \ q/p\ 

y{u^v)eE J j 


(3) 


Two simple cases of above group-norm are p = g = 1 and p = g = 2 that correspond to overall regular¬ 
ization and weight decay respectively. Another form of regularization that is shown to be very effective in 
RELU networks is the max-norm regularization, which is the maximum over all units of norm of incoming 
edge to the unij^ ^ [TSl. The max-norm correspond to “per-unit” regularization when we set g = oc in 
equation Q and can be written in the following form: 

\ i/p 

(4) 

{u^v)eE J 


Mp,oo 


sup 

vev 


Weight decay is probably the most commonly used regularizer. On the other hand, per-unit regulariza¬ 
tion might not seem ideal as it is very extreme in the sense that the value of regularizer corresponds to the 
highest value among all nodes. However, the situation is very different for networks with RELU activations 
(and other activation functions with non-negative homogeneity property). In these cases, per-unit £2 regu¬ 
larization has shown to be very effective lITSll . The main reason could be because RELU networks can be 
rebalanced in such a way that all hidden units have the same norm. Hence, per-unit regularization will not 
be a crude measure anymore. 

Since is not rescaling invariant and the values of the scale measure are different for rescaling 
equivalent networks, it is desirable to look for the minimum value of a regularizer among all rescaling 
equivalent networks. Surprisingly, for a feed-forward network, the minimum ip per-unit regularizer among 
all rescaling equivalent networks can be efficiently computed by a single forward step. To see this, we 
consider the vector 7t{w), the path vector, where the number of coordinates of ^{w) is equal to the total 
number of paths from the input to output units and each coordinate of ^{w) is the equal to the product of 
weights along a path from an input nodes to an output node. The £p-path regularizer is then defined as the 
ip norm of ^{w) El: 


i/p 


= ||7r(w;)|| = 


. 

\vin W ^V2. ..-ivout [j] 


n 

k=l 




J 


(5) 


The following Lemma establishes that the ip-path regularizer corresponds to the minimum over all equiva¬ 
lent networks of the per-unit ip norm: 

Lemma 3.1 (Ull). (^p{w) = min ( Pp^ooiH) 

^This definition of max-norm is a bit different than the one used in the context of matrix factorization fT3l . The later is similar 
to the minimum upper bound over £2 norm of both outgoing edges from the input units and incoming edges to the output units in a 
two layer feed-forward network. 
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The definition Q of the ^^-path regularizer involves an exponential number of terms. But it can be 
computed efficiently by dynamic programming in a single forward step using the following equivalent form 
as nested sums: 


4>p{w) = 


lip 

" E - E 

(vd-2^Vd-l)eE (vir,[i]^Vi)eE 

A Straightforward consequence of Lemma 3.1 is that the £p path-regularizer is invariant to rescaling, i.e. 

for any w ^ w, (/)p{w) = (l)p{w). 


\i'^d-l^'dout[j])^E 


4 Path-SGD : An Approximate Path-Regularized Steepest Descent 


Motivated by empirical performance of max-norm regularization and the fact that path-regularizer is in¬ 
variant to rescaling, we are interested in deriving the steepest descent direction with respect to the path 
regularizer (/)p{w): 
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= argmin 77 + - 7r{w) — 

w \ 12 

/ 


( 6 ) 


= argmin 


arg mm 
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in 1 ] + 


p\ 


2/p 


E\w) 


. 

\vin [^] -^V2 ■ ■ ■ -ivout [j] 
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k=l 


The steepest descent step Q is hard to calculate exactly. Instead, we will update each coordinate We inde¬ 
pendently (and synchronously) based on Q. That is: 

— argmin J^^\w) S.t. Ve'/e 


We 


(7) 


Taking the partial derivative with respect to We and setting it to zero we obtain: 

/ 

dL ' ' 




Wp, — 


E n i»/' 


where Vm[{\ - ■ ■ A ... noutb] denotes the paths from any input unit i to any output unit j that includes e. 
Solving for We gives us the following update rule: 

Tj dL 


- 


e) dw 




where 7 p(ry, e) is given as 




jpiw, e) = 


\ 


2/p 


E n 


W. 


efc I 


( 8 ) 


(9) 


We call the optimization using the update rule ([ 8 ]) path-normalized gradient descent. When used in stochastic 
settings, we refer to it as Path-SGD . 

Now that we know Path-SGD is an approximate steepest descent with respect to the path-regularizer, we 
can ask whether or not this makes Path-SGD a rescaling invariant optimization method. The next theorem 
proves that Path-SGD is indeed rescaling invariant. 
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Algorithm 1 Path-SGD update rule 
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•yiniv) = 1 

> Initialization 
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for i = 1 to (i do 
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~ ^(v^u)eE 7out('^) 
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end for 
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^{u^v)eE 
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> Update Rule 


Theorem 4.1. Path-SGD is rescaling invariant. 

Proof. It is sufficient to prove that using the update rule ([^, for any c > 0 and any v ^ E, if = 
then For any edge e in the network, if e is neither incoming nor out¬ 

going edge of the node v, then w{e) — w{e), and since the gradient is also the same for edge e we have 
However, if e is an incoming edge to v, we have that w^^\e) — cw^'^\e). Moreover, 

since the outgoing edges of v are divided by c, we get ^p{w^^\e) — ^^^^2 and 
Therefore, 


3(t+i) 
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A similar argument proves the invariance of Path-SGD update rule for outgoing edges of v. Therefore, 
Path-SGD is rescaling invariant. □ 


Efficient Implementation: The Path-SGD update rule in the way it is written, needs to consider all 
the paths, which is exponential in the depth of the network. However, it can be calculated in a time that is no 
more than a forward-backward step on a single data point. That is, in a mini-batch setting with batch size B, 
if the backpropagation on the mini-batch can be done in time BT, the running time of the Path-SGD on the 
mini-batch will be roughly {B + 1)T - a very moderate runtime increase with typical mini-batch sizes of 
hundreds or thousands of points. Algorithm shows an efficient implementation of the Path-SGD update 
rule. 

We next compare Path-SGD to other optimization methods in both balanced and unbalanced settings. 


5 Experiments 

In this section, we compare ^ 2 -Path-SGD to two commonly used optimization methods in deep learning, 
SGD and AdaGrad. We conduct our experiments on four common benchmark datasets: the standard MNIST 
dataset of handwritten digits ||8l; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes El; 
and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by 
Google Street View lITOl . Details of the datasets are shown in Table 

In all of our experiments, we trained feed-forward networks with two hidden layers, each containing 
4000 hidden units. We used mini-batches of size 100 and the step-size of 10“^, where a is an integer 
between 0 and 10. To choose a, for each dataset, we considered the validation errors over the validation set 
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Table 1: General information on datasets used in the experiments. 


Data Set 

Dimensionality 

Classes 

Training Set 

Test Set 

CIFAR-10 

3072 (32 X 32 color) 

10 

50000 

10000 

CIFAR-100 

3072 (32 X 32 color) 

100 

50000 

10000 

MNIST 

784 (28 X 28 grayscale) 

10 

60000 

10000 

SVHN 

3072 (32 X 32 color) 

10 

73257 

26032 


Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error 



Figure 2: Learning curves using different optimization methods for 4 datasets without dropout. Left panel displays 
the cross-entropy objective function; middle and right panels show the corresponding values of the training and test 
errors, where the values are reported on different epochs during the course of optimization. Best viewed in color. 

(10000 randomly chosen points that are kept out during the initial training) and picked the one that reaches 
the minimum error faster. We then trained the network over the entire training set. All the networks were 
trained both with and without dropout. When training with dropout, at each update step, we retained each 
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Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error 



Figure 3: Learning curves using different optimization methods for 4 datasets with dropout. Left panel displays the 
cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors. 
Best viewed in color. 

unit with probability 0.5. 

We tried both balanced and unbalanced initializations. In balanced initialization, incoming weights to 
each unit v are initialized to i.i.d samples from a Gaussian distribution with standard deviation 1 / y^fan-in(^’). 
In the unbalanced setting, we first initialized the weights to be the same as the balanced weights. We then 
picked 2000 hidden units randomly with replacement. For each unit, we multiplied its incoming edge and 
divided its outgoing edge by 10c, where c was chosen randomly from log-normal distribution. 

The optimization results without dropout are shown in Figure]^ For each of the four datasets, the plots 
for objective function (cross-entropy), the training error and the test error are shown from left to right where 
in each plot the values are reported on different epochs during the optimization. Although we proved that 
Path-SGD updates are the same for balanced and unbalanced initializations, to verify that despite numerical 
issues they are indeed identical, we trained Path-SGD with both balanced and unbalanced initializations. 
Since the curves were exactly the same we only show a single curve. 

We can see that as expected, the unbalanced initialization considerably hurts the performance of SGD 
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and AdaGrad (in many cases their training and test errors are not even in the range of the plot to be dis¬ 
played), while Path-SGD performs essentially the same. Another interesting observation is that even in 
the balanced settings, not only does Path-SGD often get to the same value of objective function, training 
and test error faster, but also the final generalization error for Path-SGD is sometimes considerably lower 
than SGD and AdaGrad (except CIFAR-100 where the generalization error for SGD is slightly better com¬ 
pared to Path-SGD ). The plots for test errors could also imply that implicit regularization due to steepest 
descent with respect to path-regularizer leads to a solution that generalizes better. This view is similar to 
observations in ifTTl on the role of implicit regularization in deep learning. 

The results for training with dropout are shown in Figure where here we suppressed the (very poor) 
results on unbalanced initializations. We observe that except for MNIST, Path-SGD convergences much 
faster than SGD or AdaGrad. It also generalizes better to the test set, which again shows the effectiveness 
of path-normalized updates. 

The results suggest that Path-SGD outperforms SGD and AdaGrad in two different ways. First, it can 
achieve the same accuracy much faster and second, the implicit regularization by Path-SGD leads to a local 
minima that can generalize better even when the training error is zero. This can be better analyzed by 
looking at the plots for more number of epochs which we have provided in Figures]^ andWe should also 
point that Path-SGD can be easily combined with AdaGrad to take advantage of the adaptive stepsize or 
used together with a momentum term. This could potentially perform even better compare to Path-SGD . 

6 Discussion 

We revisited the choice of the Euclidean geometry on the weights of RELU networks, suggested an alterna¬ 
tive optimization method approximately corresponding to a different geometry, and showed that using such 
an alternative geometry can be beneficial. In this work we show proof-of-concept success, and we expect 
Path-SGD to be beneficial also in large-scale training for very deep convolutional networks. Combining 
Path-SGD with AdaGrad, with momentum or with other optimization heuristics might further enhance re¬ 
sults. 

Although we do believe Path-SGD is a very good optimization method, and is an easy plug-in for SGD, 
we hope this work will also inspire others to consider other geometries, other regularizers and perhaps better, 
update rules. A particular property of Path-SGD is its rescaling invariance, which we argue is appropriate 
for RELU networks. But Path-SGD is certainly not the only rescaling invariant update possible, and other 
invariant geometries might be even better. 

Finally, we choose to use steepest descent because of its simplicity of implementation. A better choice 
might be mirror descent with respect to an appropriate potential function, but such a construction seems 
particularly challenging considering the non-convexity of neural networks. 
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Figure 4: Learning curves for more number of epochs using different optimization methods for 4 datasets without 
dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding val¬ 
ues of the training and test errors, where the values are reported on different epochs during the course of optimization. 
Best viewed in color. 
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Figure 5: Learning curves for more number of epochs using different optimization methods for 4 datasets with 
dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding 
values of the training and test errors. Best viewed in color. 
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